Abstract

This study will explore the changes and evolution in the language of Disney Pictures movies over time through the computational lenses of topic modelling and sentiment analysis. We decided to focus on spoken language in these motion pictures, which proved to be a complex problem to tackle, given the peculiarity of oral dialogues, especially in pictures aimed at children.

We focused from the start on a computational method which could give us insight into the peculiarities of each movie and, possibly, into parallels or similarities between different movies in different historical moments.


Introduction

The reason behind the choice of our subject is a deep fascination for the imaginary worlds crafted by Disney Pixar and for the language used in the company’s motion pictures. Since these movies span almost a hundred years, ranging from classic to contemporary, it was also the perfect opportunity to study chilren-centric language from a diachronic point of view. The challenge mainly consisted in the normalization and treatment of such orally-bound and children-specific language in such a way that could allow us to gather meaningful insight.

Since the object of our study was a collection of oral texts extracted from movies, we quickly identified many challenges such as the oral nature of these texts and the widespread use of narrative expedients that complicate computational processing, such as flashbacks. We needed an additional analysis that would take into consideration this specificity and thus offer us a better understanding of the meaning and implications in the plot.

When we took up the project we chose to include in the study any and all motion pictures from Disney Pixar studios up until that moment. When this choice was made, 59 movies had been created and distributed, and were therefore included, from 1937 to 2021.


After gathering and cleaning the textual data two main experiments were carried out:
1. We firstly used a tool called MALLET - a Java-based package for natural language processing - to extract n clusters of topics from each movie, finally allowing us to group motion pictures similar in themes.
2. Concurrently, we experimented with the Syuzhet library for the programming language R to compare the differences - or lack thereof - of sentimental valence among movies belonging to the same thematic cluster.
TODO: Conclusion (we found that…)

What is Mallet?

From their documentation

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

This package was the starting point of our analysis, as it allowed us - interested humanists - to rise to the challenge of complex computational analyses without the steep learning curve of more complex tools. The CLI - command line interface, the kind of application or toolkit one can use via terminal commands - was the perfect balance between control over the analysis and output and complex computations which required programming knowledge in Java.

Additionally, the toolkit is OpenSource software, released under the Apache 2.0 Licence, and widely used in the field, which also meant there is a large number of resources and solutions to common problems online.

What is Syuzhet?

Syuzhet is one of the two terms describing a narrative composition, along with the fabula, theorized by Russian Formalists Victor Shklovsky and Vladimir Propp. It refers to the “device” or technique of a narrative and is concerned with the manner in which the components of a story are organized. This is the name chosen for an R language based package specifically targeted at natural language processing analyses. Its main goal is making NLP and especially sentiment analysis in textual data widely available in a simple and direct way.


Web Scraping

After deciding on the time window of reference for the research, which spans from 1937 (the year when Snow White and the Seven Dwarfs was released) to 2021 (the year this research first started), we needed to gather all relevant titles. The Wikipedia page for Disney movies felt like the perfect place to start. We downloaded the html page using the requests module for Python and subsequently parsed the document tree using beautifulsoup, an XML and HTML parsing library in Python.

from bs4 import BeautifulSoup, PageElement
import json
import requests

# Wikipedia page for "Disney Movies"
DISNEY_URI = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Animation_Studios_films"

# Retrieve the webpage in HTML format
response = requests.get(DISNEY_URI)

if response.status_code >= 200:
    with open("disney_titles.html", "r") as fp:
        data = fp.read()
        # Turn html string into Soup object
        soup = BeautifulSoup(data, features="lxml")
        # retrieve all table rows containing movie titles
        rows: List[PageElement] = soup.find_all("tr")

        disney_movies = dict()
        # Find first td element (title) and second td element (year) and aggregate in dict object
        for row in rows:
            disney_movies[row.find_all("td")[0].text.strip("\n")] = {"release_date": row.find_all("td")[1].text.strip("\n").replace(" ", " ")}
        
        # Save dict as disney_titles.json
        with open("disney_titles.json", "w") as outfile:
            json.dump(disney_movies, outfile)

Once this step was over we had a JSON file containing a map of each movie title together with its release year, in the format { "Snow White and the Seven Dwarfs": {"year": 1937} }. To gather all subtitles for these files we needed an open collection of subtitles, and we found OpenSubtitles’ service to fit perfectly our needs. They provide an open REST API, so after obtaining a key and getting comfortable with the documentation we quickly turned the list of titles in JSON into a folder of .srt files. .srt files are very easy to work with, since they are written in plain text and the formatting is very predictable. Since at this stage we were working in Python, we decided to clean the raw subtitles using a Python library called pysrt which proved essential to extracting textual data from the .srt files. Concurrently, we noticed that many texts were rich with html tags, descriptions of surroundings, advertisements and so on. While gathering the texts we therefore also started cleaning them. This is one of the functions used to remove unwanted textual data from our subtitles:

def parse_subs() -> Dict[str, list]:
    """ 
    Turn subtitle files in object:
    {"movie_name_YEAR": ["text"], ...}
    """
    subs_directory = "subs/"
    final_object = {}
    for file in os.listdir("./subs"):
        # Parse .srt file for easier handling
        try:
            srt  = pysrt.open(subs_directory + file)
        except UnicodeDecodeError:
            print(f"Error handling file: {file}\nSkipping...")

        # Remove opensubtitles ads and intro: 
        opensubs_ads = r'(♪)|(Advertise your product or brand here)|(contact www\.OpenSubtitles\.(org|com) today)|(Support us and become VIP member)|(to remove all ads from www\.OpenSubtitles\.(org|com))|(-== \[ www\.OpenSubtitles\.(org|com) \] ==-)|((((Subtitles by )|(Sync by ))(.+))$)|(font color="(.+)?")|(Provided by(.+)$)|(^(https?):\/\/[^\s\/$.?#].[^\s]*$)|(Please rate this subtitle at (.)+$)|(Help other users to choose the best subtitles)'
        remove_ads = re.sub(re.compile(opensubs_ads), "", srt.text)
        # Remove html tags, dashes (dialogues), returns
        remove_curly = re.sub(re.compile(r"\{.*?\}"), "", remove_ads)
        remove_html = re.sub(re.compile(r"((<[^>]+>)+)"), " ", remove_curly)
        remove_html_closing = re.sub(re.compile(r"((<\/[^>]+>)+)"), " ", remove_html)
        remove_dashes = re.sub(re.compile(r"-\s"), " ", remove_html_closing)
        remove_returns = re.sub(re.compile(r"[\r\t\n]"), " ", remove_dashes)
        # allowed_chars = string.ascii_letters + " " + "'" + "-" + "."
        # remove_punctuation_lowercase = "".join([char.lower() for char in remove_returns if char in allowed_chars])
        remove_double_spaces = re.sub(re.compile(r"(\s+)"), " ", remove_returns)
        remove_starting_spaces = re.sub(re.compile(r"(^\s)"), "", remove_double_spaces)
        year = file.split("_")[-1].strip(".srt")
        title = "_".join(file.split("_")[:-1])

        final_object[title] = [year, remove_starting_spaces]

    return final_object 

Finally, we were done scraping and cleaning data. At this point the output of the first round was pickled (serialized in a python-specific library) for future manipulation and saved as the first dataset.


2. Topic Modelling

2.1 Data Pre-processing

After data has been scraped and got through a first cleaning stage, we made further adjustments in order to optimize Mallet’s tasks.

The input files for Mallet come are copies of the web scraped personal and organizations’ names are cleaned out from the text using spacy’s en_core_web_sm., since in previous trials with Mallet we found them out to be noisy.

Further cleaning consists in the deployment of nltk’s POS tagger has been used to extract only thos words labeled as NN (i.e., nouns) and equal or greater the 4 characters; a regex was also inserted to remove words with apostrophes (e.g., “ya’ll”) that were missed by both spacy’s parser and nltk’s tagger.

The resulting files were saved into directory cartoonlp/nn_txts.

from nltk.tokenize import word_tokenize
import spacy
import os.path 
import nltk
import re


NER = spacy.load("en_core_web_sm")
path = "nn_txts/"
for file in os.listdir("./txts"):
    with open("./txts/"+file, "r")  as new_file:
        text = new_file.read()
        stripped_text = []
        parsed = NER(text)
        for word in parsed.ents: #automatic detection of person and organizations to remove  
            if word.label_ == "PERSON" or word.label_ == "ORG": 
                text = text.replace(str(word), "")
        tokens = word_tokenize(text)
        tagged = nltk.pos_tag(tokens)
        for word, tag in tagged:
            
            if tag == 'NN' and len(word)>4: #adding to a new string only tosewords recognised as nouns and longer then 4 characters
                stripped_text.append(word)

        for word in stripped_text:#remove words with aphostrophes such as pronouns
            if re.search(r"\w+[']\w+?",word): 
                stripped_text.remove(str(word))
                
        new_string=" ".join(str(x) for x in stripped_text)
       
        out_file=open(path+file,"w")
        out_file.write(new_string)
        out_file.close()


2.2 Data Processing

The topic modelling was released working from the shell with Mallet.

First we imported the directory with pre-processed files in the nn_txts directory into into Mallet and removed English stop words if any detected with command --remove-stopwords.

mallet import-dir \

--input sample-data/nn_txts \

--output disney_topics.mallet \

--keep-sequence --remove-stopwords


Then we an exploratory phase in order to understand which parameters were the most appropriate for a small corpus as ours and function 1 was further refined into the final form showed above here. Here we iterated mallet multiple times over the corpus, changing parameters as the cluster quality improved in our opinion and KK/log

We detected as useful input parameters for train-topics:

  • –num-topics: the actual number of topics retrieved from mallet

  • –optimize-burn-in: The number of iterations before hyper-parameter optimization begins. Default is twice the optimize interval.1

We started with a low number of topics i.e., 6 and started increasing it up to 15, at this point we decided clusters where satisfying: each cluster was intelligible, homogeneous and words made sense with the movies they were assigned to.

Burn in was also raised to 60 since we noticed it helped with topics’ homogenization.

mallet train-topics --input disney_topics.mallet \

--num-topics 15 \

--optimize-burn-in 60 \

--output-state disney-topic-state.gz \

--output-topic-keys disney_keys.txt \

--output-doc-topics disney_composition.csv \

--xml-topic-report disney_report.xml


## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##   Year                               Title           T1           T2
## 1 1937 Snow_White_and_the_Seven_Dwarfs.txt 0.0013831259 0.0677731674
## 2 1940                   Fantasia_2000.txt 0.0977011494 0.1666666667
## 3 1940                       Pinocchio.txt 0.0008012821 0.0248397436
## 4 1941                           Dumbo.txt 0.0011037528 0.0242825607
## 5 1942                           Bambi.txt 0.0024691358 0.0024691358
## 6 1942                  Saludos_Amigos.txt 0.0051057622 0.0007293946
##             T3          T4          T5          T6          T7         T8
## 1 0.0013831259 0.005532503 0.022130014 0.001383126 0.001383126 0.22544952
## 2 0.0028735632 0.002873563 0.002873563 0.011494253 0.002873563 0.03735632
## 3 0.0128205128 0.017628205 0.039262821 0.017628205 0.491185897 0.13782051
## 4 0.0309050773 0.011037528 0.037527594 0.044150110 0.001103753 0.24282561
## 5 0.0024691358 0.009876543 0.024691358 0.002469136 0.002469136 0.18765432
## 6 0.0007293946 0.020423049 0.029175784 0.016046681 0.026987600 0.03136397
##            T9          T10          T11          T12        T13         T14
## 1 0.242047026 0.0013831259 0.0013831259 0.0013831259 0.14246196 0.183955740
## 2 0.002873563 0.0373563218 0.0632183908 0.1063218391 0.45977011 0.002873563
## 3 0.056089744 0.0008012821 0.0008012821 0.0008012821 0.14983974 0.048878205
## 4 0.020971302 0.4116997792 0.0275938190 0.0011037528 0.03421634 0.007726269
## 5 0.009876543 0.0098765432 0.0024691358 0.0246913580 0.50617284 0.017283951
## 6 0.048869438 0.4668125456 0.0335521517 0.1889132020 0.09919767 0.022611233
##            T15
## 1 0.1009681881
## 2 0.0028735632
## 3 0.0008012821
## 4 0.1037527594
## 5 0.1950617284
## 6 0.0094821298

Movies in mallet_values.csv were chronologically ordered and plotted into a stack bar

2.3 Topics analysis


First we want to look at the count of movies for each topic, to see how movies are distributed among topics. To do so we have plotted a line chart for each topic. We noticed that all the topics have a few movies in which their weight value is above 0.20 and selected this number as a minimum threshold for choosing which movie to include in which cluster.

\[inserisci screen dei grafici di numbers\]

#Create columns for movies' release dates and titles
date <- paste(mallet_values$Year)
movie <- paste(mallet_values$Title)
#Create columns with binary values for each tpoic -- example here with T1
T1 <- mallet_values$T1
vT1 <- 0.20
Topic1 <- vector()
for (v in T1) {

  if (v>= vT1) {

  Topic1<- c(Topic1, 1)
  } else {
  Topic1<-c(Topic1,0)
  }
}
print(Topic1)
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
#create matrix 
matrix<-data.frame(date,movie,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,Topic10,Topic11,Topic12,Topic13,Topic14,Topic15)
matrix
##    date                                      movie Topic1 Topic2 Topic3 Topic4
## 1  1937        Snow_White_and_the_Seven_Dwarfs.txt      0      0      0      0
## 2  1940                          Fantasia_2000.txt      0      0      0      0
## 3  1940                              Pinocchio.txt      0      0      0      0
## 4  1941                                  Dumbo.txt      0      0      0      0
## 5  1942                                  Bambi.txt      0      0      0      0
## 6  1942                         Saludos_Amigos.txt      0      0      0      0
## 7  1944                   The_Three_Caballeros.txt      0      0      0      0
## 8  1946                        Make_Mine_Music.txt      0      0      0      0
## 9  1947                     Fun_and_Fancy_Free.txt      0      0      0      0
## 10 1949 The_Adventures_of_Ichabod_and_Mr._Toad.txt      0      0      0      0
## 11 1950                             Cinderella.txt      0      0      0      0
## 12 1951                    Alice_in_Wonderland.txt      0      0      0      0
## 13 1953                              Peter_Pan.txt      0      0      1      0
## 14 1955                     Lady_and_the_Tramp.txt      0      0      0      0
## 15 1958                            Melody_Time.txt      0      0      0      0
## 16 1959                        Sleeping_Beauty.txt      0      1      0      0
## 17 1961         One_Hundred_and_One_Dalmatians.txt      0      0      0      0
## 18 1963                 The_Sword_in_the_Stone.txt      0      1      0      0
## 19 1967                        The_Jungle_Book.txt      1      0      0      0
## 20 1970                         The_Aristocats.txt      0      0      0      1
## 21 1973                             Robin_Hood.txt      0      0      0      1
## 22 1977 The_Many_Adventures_of_Winnie_the_Pooh.txt      0      0      0      0
## 23 1977                The_Rescuers_Down_Under.txt      0      0      1      0
## 24 1981                  The_Fox_and_the_Hound.txt      0      0      0      0
## 25 1985                     The_Black_Cauldron.txt      0      0      0      0
## 26 1986              The_Great_Mouse_Detective.txt      0      0      0      0
## 27 1988                       Oliver_&_Company.txt      0      0      0      1
## 28 1989                     The_Little_Mermaid.txt      0      0      0      0
## 29 1990                           The_Rescuers.txt      0      0      1      0
## 30 1991                   Beauty_and_the_Beast.txt      0      0      0      0
## 31 1992                                Aladdin.txt      0      0      0      1
## 32 1994                          The_Lion_King.txt      0      0      0      0
## 33 1995                             Pocahontas.txt      0      0      0      0
## 34 1996            The_Hunchback_of_Notre_Dame.txt      0      1      0      0
## 35 1997                               Hercules.txt      0      0      0      0
## 36 1998                                  Mulan.txt      0      0      0      0
## 37 1999                               Fantasia.txt      0      0      0      0
## 38 1999                                 Tarzan.txt      0      0      0      0
## 39 2000                               Dinosaur.txt      0      0      0      0
## 40 2000               The_Emperor's_New_Groove.txt      0      0      1      0
## 41 2001               Atlantis_The_Lost_Empire.txt      0      0      0      0
## 42 2002                        Treasure_Planet.txt      0      0      1      0
## 43 2003                           Brother_Bear.txt      0      0      0      0
## 44 2004                      Home_on_the_Range.txt      0      0      0      1
## 45 2005                         Chicken_Little.txt      0      0      0      0
## 46 2007                          Lilo_&_Stitch.txt      0      0      0      0
## 47 2007                     Meet_the_Robinsons.txt      0      0      0      0
## 48 2008                                   Bolt.txt      0      0      0      0
## 49 2009              The_Princess_and_the_Frog.txt      0      0      0      0
## 50 2010                                Tangled.txt      0      1      0      0
## 51 2011                        Winnie_the_Pooh.txt      0      0      0      0
## 52 2012                         Wreck-It_Ralph.txt      1      0      0      0
## 53 2013                              Frozen_II.txt      0      0      0      0
## 54 2014                             Big_Hero_6.txt      0      0      0      0
## 55 2016                                  Moana.txt      0      0      0      0
## 56 2016                               Zootopia.txt      0      0      0      0
## 57 2018              Ralph_Breaks_the_Internet.txt      1      0      0      0
## 58 2019                                 Frozen.txt      0      0      0      0
## 59 2021               Raya_and_the_Last_Dragon.txt      0      0      0      0
##    Topic5 Topic6 Topic7 Topic8 Topic9 Topic10 Topic11 Topic12 Topic13 Topic14
## 1       0      0      0      1      1       0       0       0       0       0
## 2       0      0      0      0      0       0       0       0       1       0
## 3       0      0      1      0      0       0       0       0       0       0
## 4       0      0      0      1      0       1       0       0       0       0
## 5       0      0      0      0      0       0       0       0       1       0
## 6       0      0      0      0      0       1       0       0       0       0
## 7       0      0      0      0      0       1       0       0       0       0
## 8       0      0      0      0      0       0       0       0       1       0
## 9       0      0      0      0      0       0       0       0       0       1
## 10      0      0      0      0      0       0       0       1       0       0
## 11      0      0      0      1      0       0       0       0       1       0
## 12      0      0      0      0      0       0       0       0       0       1
## 13      0      0      0      1      0       0       0       0       0       0
## 14      0      0      0      1      0       0       0       0       0       0
## 15      1      0      0      0      0       0       0       0       0       0
## 16      0      0      0      0      0       0       0       0       0       0
## 17      0      0      1      1      0       0       0       0       0       0
## 18      0      0      0      0      0       0       0       0       0       0
## 19      0      0      0      1      0       0       0       0       0       0
## 20      0      0      0      1      0       0       0       0       0       0
## 21      0      0      0      0      0       0       0       0       0       0
## 22      0      1      0      1      0       0       0       0       0       0
## 23      0      0      0      0      0       0       0       0       0       0
## 24      0      0      0      1      0       0       0       0       0       0
## 25      1      0      0      0      0       0       0       0       0       1
## 26      0      1      0      0      0       0       0       0       0       0
## 27      0      0      0      0      0       0       0       0       0       0
## 28      0      0      0      0      0       0       0       0       0       0
## 29      0      0      0      1      0       0       0       0       0       0
## 30      0      0      0      0      0       0       0       0       0       1
## 31      0      0      0      0      0       0       0       0       0       0
## 32      0      0      0      1      1       0       0       0       0       0
## 33      1      0      0      0      1       0       0       0       0       0
## 34      0      0      0      0      0       0       0       0       0       0
## 35      1      0      0      0      0       0       0       0       1       0
## 36      1      0      0      0      0       0       0       0       0       0
## 37      0      0      0      0      0       0       0       0       1       0
## 38      1      0      0      1      1       0       0       0       0       0
## 39      0      1      0      0      1       0       0       0       0       0
## 40      0      0      0      0      0       0       0       0       0       0
## 41      0      0      0      0      0       0       0       1       0       0
## 42      0      0      0      0      0       0       0       0       0       0
## 43      0      0      0      0      1       0       0       0       0       0
## 44      0      0      0      0      0       0       0       0       0       0
## 45      0      0      0      0      0       0       1       0       0       0
## 46      0      0      0      1      0       1       0       0       0       0
## 47      0      0      0      0      0       0       1       0       0       0
## 48      1      0      0      0      0       0       0       0       0       0
## 49      0      0      0      0      0       0       0       0       0       0
## 50      0      0      0      0      0       0       0       0       0       0
## 51      0      1      0      0      0       0       0       0       0       0
## 52      0      0      0      0      0       0       0       0       0       0
## 53      1      0      0      0      0       0       0       0       0       0
## 54      0      0      0      0      0       0       1       0       0       0
## 55      0      0      0      0      1       0       0       0       0       0
## 56      0      0      1      0      0       0       0       0       0       0
## 57      0      0      0      0      0       0       0       0       0       0
## 58      1      0      0      0      0       0       0       0       0       0
## 59      0      0      0      0      0       0       0       0       0       0
##    Topic15
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0
## 7        0
## 8        0
## 9        0
## 10       0
## 11       0
## 12       0
## 13       0
## 14       0
## 15       0
## 16       0
## 17       0
## 18       0
## 19       0
## 20       0
## 21       0
## 22       0
## 23       0
## 24       0
## 25       0
## 26       0
## 27       0
## 28       0
## 29       0
## 30       0
## 31       1
## 32       0
## 33       0
## 34       0
## 35       0
## 36       0
## 37       0
## 38       0
## 39       0
## 40       1
## 41       0
## 42       0
## 43       0
## 44       0
## 45       0
## 46       0
## 47       0
## 48       0
## 49       1
## 50       0
## 51       0
## 52       0
## 53       1
## 54       0
## 55       0
## 56       0
## 57       0
## 58       0
## 59       1

Movies in T1


movies <- c()
dates <-c()
t_weight<-c()
for (i in rownames(matrix)) {
   title<- matrix[i, "movie"]
   
   date<- matrix[i, "date"]
   row <- mallet_values[match(title,mallet_values$Title)+1, ]
   w <- row$T1
 if (matrix[i, "Topic1"] ==1){movies<- c(movies, title)
   dates<- c(dates, date) 
   t_weight<- c(t_weight, w)}
 }
 
cluster_T1 <-data.frame(movies, dates, t_weight)
cluster_T1
##                          movies dates   t_weight
## 1           The_Jungle_Book.txt  1967 0.05939716
## 2            Wreck-It_Ralph.txt  2012 0.03624901
## 3 Ralph_Breaks_the_Internet.txt  2018 0.02752976

thing friend medal jungle wheel video arcade stuff track racer glitch building today virus man-village buddy inurity march credit princess


Movies in T2
##                            movies dates  t_weight
## 1             Sleeping_Beauty.txt  1959 0.4325530
## 2      The_Sword_in_the_Stone.txt  1963 0.4678250
## 3 The_Hunchback_of_Notre_Dame.txt  1996 0.2336011
## 4                     Tangled.txt  2010 0.5279748

dream world birthday child kingdom tower magic stone sword power witch crown blood flower story miracle tomorrow gleam light castle

Movies in T3
##                         movies dates  t_weight
## 1                Peter_Pan.txt  1953 0.4651852
## 2  The_Rescuers_Down_Under.txt  1977 0.3828829
## 3             The_Rescuers.txt  1990 0.4395712
## 4 The_Emperor's_New_Groove.txt  2000 0.3060960
## 5          Treasure_Planet.txt  2002 0.5076209

captain treasure diamond emperor pirate llama world order silver leader shadow flight singing house woman cyborg career chief shirt cliff


Movies in T4
##                  movies dates  t_weight
## 1    The_Aristocats.txt  1970 0.2641844
## 2        Robin_Hood.txt  1973 0.3610411
## 3  Oliver_&_Company.txt  1988 0.2983947
## 4           Aladdin.txt  1992 0.2201722
## 5 Home_on_the_Range.txt  2004 0.4520104

money street sheriff woman mouth kitty uncle range horse carpet reward partner trail alley sultan outta church minute property permission


Movies in T5
##                   movies dates  t_weight
## 1        Melody_Time.txt  1958 0.3809524
## 2 The_Black_Cauldron.txt  1985 0.4414414
## 3         Pocahontas.txt  1995 0.3390640
## 4           Hercules.txt  1997 0.2526455
## 5              Mulan.txt  1998 0.4562963
## 6             Tarzan.txt  1999 0.2733333
## 7               Bolt.txt  2008 0.3848712
## 8          Frozen_II.txt  2013 0.3672183
## 9             Frozen.txt  2019 0.4784226

thing heart family father chance river sister moment truth point daughter fault spirit death danger strength question choice sword place


Movies in T6
##                                       movies dates  t_weight
## 1 The_Many_Adventures_of_Winnie_the_Pooh.txt  1977 0.5537998
## 2              The_Great_Mouse_Detective.txt  1986 0.4223541
## 3                               Dinosaur.txt  2000 0.2229039
## 4                        Winnie_the_Pooh.txt  2011 0.5779645

thing honey fellow goodness friend doctor house moment tummy narrator brain queen stuff mouse thought prize sense chapter bother bottle


Movies in T7
##                               movies dates  t_weight
## 1                      Pinocchio.txt  1940 0.4911859
## 2 One_Hundred_and_One_Dalmatians.txt  1961 0.3540490
## 3                       Zootopia.txt  2016 0.5849822

bunny savage world father conscience actor school plenty predator otter crime officer chain number couple whale traffic alert police system


Movies in T8
##                                        movies dates  t_weight
## 1         Snow_White_and_the_Seven_Dwarfs.txt  1937 0.2254495
## 2                                   Dumbo.txt  1941 0.2428256
## 3                              Cinderella.txt  1950 0.2100089
## 4                               Peter_Pan.txt  1953 0.2318519
## 5                      Lady_and_the_Tramp.txt  1955 0.5228938
## 6          One_Hundred_and_One_Dalmatians.txt  1961 0.4105461
## 7                         The_Jungle_Book.txt  1967 0.3361742
## 8                          The_Aristocats.txt  1970 0.2349291
## 9  The_Many_Adventures_of_Winnie_the_Pooh.txt  1977 0.2513168
## 10                  The_Fox_and_the_Hound.txt  1981 0.3776758
## 11                           The_Rescuers.txt  1990 0.2202729
## 12                          The_Lion_King.txt  1994 0.3257230
## 13                                 Tarzan.txt  1999 0.2093333
## 14                          Lilo_&_Stitch.txt  2007 0.2067968

place night mother minute morning friend thing matter trouble house surprise hurry business earth tonight creature goodness charge today devil


Movies in T9
##                                movies dates  t_weight
## 1 Snow_White_and_the_Seven_Dwarfs.txt  1937 0.2420470
## 2                   The_Lion_King.txt  1994 0.2663623
## 3                      Pocahontas.txt  1995 0.2244508
## 4                          Tarzan.txt  1999 0.2813333
## 5                        Dinosaur.txt  2000 0.3087935
## 6                    Brother_Bear.txt  2003 0.4429224
## 7                           Moana.txt  2016 0.6515152

heart water brother island village mountain world ocean voice monster share story earth stuff mission board chicken journey darkness ground


Movies in T10
##                     movies dates  t_weight
## 1                Dumbo.txt  1941 0.4116998
## 2       Saludos_Amigos.txt  1942 0.4668125
## 3 The_Three_Caballeros.txt  1944 0.5375218
## 4        Lilo_&_Stitch.txt  2007 0.3868402

gaucho plane circus angel motion elephant climax planet samba potato peanut shelter knife picture lilongo stand saddle roller stitch feather


Movies in T11
##                   movies dates  t_weight
## 1     Chicken_Little.txt  2005 0.3483299
## 2 Meet_the_Robinsons.txt  2007 0.6327313
## 3         Big_Hero_6.txt  2014 0.5737110

future today machine family science school buddy garage chance cover story class question invention baseball robot project problem companion control


Movies in T12
##                                       movies dates  t_weight
## 1 The_Adventures_of_Ichabod_and_Mr._Toad.txt  1949 0.4727815
## 2               Atlantis_The_Lost_Empire.txt  2001 0.3826788

dream bridge grandfather crystal adventure power round excitement motorcar paper schoolmaster source price court country mania language flight police decision


Movies in T13
##                movies dates  t_weight
## 1   Fantasia_2000.txt  1940 0.4597701
## 2           Bambi.txt  1942 0.5061728
## 3 Make_Mine_Music.txt  1946 0.5899796
## 4      Cinderella.txt  1950 0.4781055
## 5        Hercules.txt  1997 0.2189153
## 6        Fantasia.txt  1999 0.6557971

music heart story hurry dress spring number stuff dream window sound slipper beauty romance picture country matter tonight sweet glass


Movies in T14
##                     movies dates  t_weight
## 1   Fun_and_Fancy_Free.txt  1947 0.2018197
## 2  Alice_in_Wonderland.txt  1951 0.6504254
## 3   The_Black_Cauldron.txt  1985 0.2136422
## 4 Beauty_and_the_Beast.txt  1991 0.5483405

beast master castle monster father world watch party rabbit trouble afternoon child apple chance dinner pardon fault spell advice guest


Movies in T15
##                          movies dates  t_weight
## 1                   Aladdin.txt  1992 0.2681427
## 2  The_Emperor's_New_Groove.txt  2000 0.2166018
## 3 The_Princess_and_the_Frog.txt  2009 0.4825248
## 4                 Frozen_II.txt  2013 0.3435776
## 5  Raya_and_the_Last_Dragon.txt  2021 0.5064836

prince world princess magic water voice dragon future night palace forest today daughter sense problem light thing reason bayou restaurant


Findings

Abbiamo visto che alcuni cluster non hanno molto senso tipo l’8 ma perché è orale il linguaggio e ci sta che mallet faccia cazzate, pero! Però, alcuni avevano senso per fortune ed abbiamo visto alcune cose ovvero queste:

Topic two classico da favola presente maggiormente a cavallo degli anni ’60. Torna in modo prominentenel 2010.

Sentiment: paragone 1959 e 1963 perché sono vicini e poi vedere che cambia in 2010

Topic three Advebture compare 1953, 1990 e 2002

Topic five 1997-1999 belli duye donne due uomin come protagonisti e hanno dei weight che sembrano correlati al sesso

T8 spieghiamo cosa potrebbe rppresentare, mma senza fagli syuzhet oral-bound/esclamazioni/linguaggio generale da cartoni animati

T9 Wandering and perception of nature da 1999-2016 forte aumento, vediamo se incide sulla sentiment

T11 high-tech vedere come è percepito nei dialoghi e comparare 2007 e 2014

T13 linguaggio da favola dei film rd in comune hanno colonna sonoraominante sul testo (babmbi, make mine music e cinderella), vicini di anni capire se c’è una correlazione nel cluster

T14 mondo FANTASTICO/magia paragone 1951 e 1991

T15 principesse emancipate 2009 2013 2021



3. Syuzhet analysis


First we import the Syuzhet package and read the csv file containing all the films with the tokenized sentences.

library(syuzhet) #enables Syuzhet oackage 
library(dplyr) #enables glimpse()
library(rmarkdown) #for pretty prints

df <- read.csv(url("https://raw.githubusercontent.com/fcagnola/cartoonlp/main/03_out_dataframe.csv"))

glimpse(df)
## Rows: 59
## Columns: 5
## $ X                  <chr> "Chicken_Little", "Frozen", "Bolt", "Fantasia_2000"…
## $ Year               <int> 2005, 2013, 2008, 1999, 1949, 1940, 2012, 1942, 200…
## $ Text               <chr> "Now, where to begin? How 'bout, ''Once upon a time…
## $ Sentence_Tokenized <chr> "['Now, where to begin?', \"How 'bout, ''Once upon …
## $ Tokenized          <chr> "['now', 'begin', 'how', \"'bout\", '``', 'once', '…


The second step was readjusting a copy of the data frame fitting it to our purpose by:

texts_df<- df[, c("X", "Year", "Text")]
texts_df<- texts_df %>% rename(Title= X)
texts_df <- texts_df %>% arrange(Year)
for( i in rownames(texts_df) ){
  string <- texts_df[i, "Text"]
  count <- lengths(gregexpr("\\W+", string)) + 1
  texts_df[i, "Lenght"] = count
}

An example of the final data frame is illustrated here



Experiment on T2

Text processed during the scraping phase is retrieved from the texts_df data frame.

text1 = "Sleeping_Beauty"
row_1 <- texts_df [match(text1, texts_df $Title ),]
string_1<- row_1$Text

text2 = "The_Sword_in_the_Stone"
row_2 <- texts_df [match(text2, texts_df $Title ),]
string_2 <- row_2$Text

text3= "Tangled"
row_3 <- texts_df [match(text3, texts_df $Title ),]
string_3 <- row_3$Text


Calculating sentiment scores cs of the two texts using Syuzhet library and its default method

v_1<- get_sentences(string_1)
v_2 <- get_sentences(string_2)
v_3 <- get_sentences(string_3)

sv_1<- get_sentiment(v_1, method="syuzhet")
sv_2<- get_sentiment(v_2, method="syuzhet")
sv_3<- get_sentiment(v_3, method="syuzhet")


rescaled_x_2

calculate a moving average for each vector of raw values. We’ll use a window size equal to 1/10 of the overall length of the vector.

#roll

wdw_1 <- round(length(sv_1)*.1)
rolled_1 <- zoo::rollmean(sv_1, k=wdw_1)
wdw_2 <- round(length(sv_2)*.1)
rolled_2 <- zoo::rollmean(sv_2, k=wdw_2)
wdw_3 <- round(length(sv_3)*.1)
rolled_3 <- zoo::rollmean(sv_3, k=wdw_3)


list_1 <- rescale_x_2(rolled_1)
list_2 <- rescale_x_2(rolled_2)
list_3 <- rescale_x_2(rolled_3)

sample_1 <- seq(1, length(list_1$x), by=round(length(list_1$x)/100))
sample_2 <- seq(1, length(list_2$x), by=round(length(list_2$x)/100))
sample_3 <- seq(1, length(list_3$x), by=round(length(list_3$x)/100))

#normalization for comparison

x1 <- 1:length(sv_1)
y1 <- sv_1
raw_1 <- loess(y1 ~ x1, span=.5)
line1 <- rescale(predict(raw_1))
x2 <- 1:length(sv_2)
y2 <- sv_2
raw_2 <- loess(y2 ~ x2, span=.5)
line2 <- rescale(predict(raw_2))
x3 <- 1:length(sv_3)
y3 <- sv_3
raw_3 <- loess(y3 ~ x3, span=.5)
line3 <- rescale(predict(raw_3))

sample_1 <- seq(1, length(line1), by=round(length(line1)/100))
sample_2 <- seq(1, length(line2), by=round(length(line2)/100))
sample_3 <- seq(1, length(line3), by=round(length(line3)/100))

plot(line1[sample_1], 
     type="l", 
     col="blue",
     xlab="Narrative Time (sampled)", 
     ylab="Emotional Valence"
     )
lines(line2[sample_2], col="orange")
lines(line3[sample_3], col="green")



legend(75, 1, legend=c(text1, text2, text3),
       col=c("blue", "orange", "green"), lty=1:1, cex=0.5,
       title="Movies", text.font=4, bg='white')

COse interessanti: tangled è inteessante perché inizia come il primo e termina con un arco molto simile al secondo

Experiment on T3

1959 e 1963 perché sono vicini e poi vedere che cambia in 2010

Inizio abbastanza analogo, poi peter pan e treasure planet diventano gli opposti. Quelli prima del 2000 sono più concitati/altalenanti

Experiment on T5

Si nota che i film con protagoisti maschili tendono ad avere curve simili e dei finali molto più positivi rispetto alle colleghe femminucce perché ai maschi deve sempre andare meglio. Mentre quelli femminili hanno la caratteristica di partir positivi n el primo quinto della trama e terminarecon un valore di almeno 0.5 inferiore al punto di inizio

Experiment on T9

I film prodotti dopo l’inizio del 2000 hanno un final che tende ad un valore negativo nella sentiment, mentre alla fine degli anni ’90 si teneva comunque un finale positivo. L’ultimo film ha una curva piùà “complessa” rispetto agli altri. Ad eccezione di Dinosaur sembrano partire da una nota molto positiva, benche (moana a parte) anche BB e Tarzano scendono molto velocemente sotto lo zero, mantenedo valori negativi fino alla fine del film.

Experiment on T11

Interessante qui perchè il film del 2007 ha un inizio neutrale e scende subit, mentre il film dl 2014 parte in modo negativo ma tende ad alzarsi subito e mantenere un valore in media più alto. Le due linee restano comunque molto simili e si può dedurre che c’è una connotazione positiva nei dialoghi dei cartoni che potrebbe essere indicativa dello spirito culturale dei primi anni duemila che aveva una visione per la maggiore positivista nei confronti dell’inovazione tecnologica

Experiment on T13

Pur tenendo conto della grossa disparità di incidenza di testo parlato sulla musica (cinderella molto testo, gli altri molta musica) bambi e cenerella sono moltopiù simili rispetto a MMM perché l’ultimo è una compilation di 10 shortfilm quindi non possiamo considerare l’andamento della sentiment di quest’ultimo come rapppresentativo di una trama ma dell’editing fatto dai produttori, questo lo rende più deliberato in termini di valore emozionale, perché non deve seguire una trama ore impostato dalla favola come succede ad esempio in cenerentola chè soggetto a costrizioni date dalla sua natura di favola tradizionale. Esemplare dell’incidenza delle scelte artistiche dei produttori/chi cazzo ha fatto sto coso

Experiment on T14

Nonostante inizino con valori diversi (anche molto) a metà del film entrambi raggiungono l’apice di positività e da questo momento in poi tendo ad avere un ndamento simile e discendente

Experiment on T15

Tutti i film hanno valori molto simili sia all’inizio che alla fine, i due più recenti che dimostrano un andamento paragonabile con picchi positivi verso la metà del film aventi prma e dopo i picchi negativi, mentre il più aniano è moltopiù lineare perché dopo un momento di discesa con picco(-) a metà risale e basta fino alla fine

Conclusions

Abbiamo estratto per alcuni delle valuable insights, siamo consci delle limitazioni dovute alle nostre conoscenze nel campo di data science dei dati e della limitatezza del nostro studio.

Ciònonostante sono venute fuori delle considerazioni interessanti grazie all’intersezione tra i dati rilevati e le nostre conoscenze in campo storico, sociale e culturale che ci hanno permesso di rtrarre conseguenze sull’andamento di film con protagonisti maschili e femminili, l’evoluzione dei costumi e della societa cpme per il discorso fuuro/niuove tecnologie.

Speroiamo che questo studio iniziale di natura esplorativa possa essere spunto per futuri approfondimenti di quello che un campo, ovvero il linguaggio nei film per bambini, a nostro avviso poco esplorato, ma importante perché incidente nell’evoluzione e nella formazione di piccoli esseri umani, al contempo fungendo da cartina tornasole dei trend sociali della loro epoca


  1. cit. https://mimno.github.io/Mallet/topics↩︎